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1.  Introduction 


This  report  summarizes  the  main  results  of  our  research  carried  out  at  the  University  of 
South  Florida  (USF)  under  Contract  No.  N{XX)014-86-K-0392-P00(X)l  during  the  period  of  June  1, 
1986  -  June  30, 1987.  The  project  was  motivated  by  the  recognition  that  designing  real-time 
distributed  computer  systems  (DCS's)  had  been  largely  an  artistic  activity  and  a  scientific 
foundation  for  reliable  and  systematic  design  existed  only  in  a  weak  and  incoherent  state.  Such 
foundation  has  become  an  important  research  issue  in  computer  science  and  engineering  due  to 
the  continuous  increase  in  demands  for  ultra-reliable  computer  systems  capable  of  supporting 
critical  real-time  applications.  While  the  long  term  objective  of  this  project  was  to  contribute  to 
the  establishment  of  the  scientific  foundation  for  designing  fault-tolerant  DCS's  with  response 
time  guarantee,  more  specific  goals  of  the  project  were  the  following. 

(1)  Establish  a  real-time  distributed  computation  model  yielding  simple  techniques  for  response 
time  guarantee, 

(2)  Develop  DCS  architectures  possessing  effective  fault-tolerance  capability,  expandability,  and 
high  predictability  of  worst-case  performance, 

(3)  Develop  design  environments  supporting  specification  and  validation  of  real-time  behavior. 

The  research  reported  here  constituted  the  first  phase  of  a  two-phase  project.  The  follow- 
on  phase.  Phase  11,  was  conducted  at  the  University  of  California,  Irvine  (UCI)  under  Contract  No. 
N(XX)14-87-K-0231  and  concluded  on  Dec.  31,  1989.  An  abstract  of  the  Phase-II  research  results 
is  included  in  the  concluding  section. 
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2.  Research  Directions 


At  the  early  stage  of  this  project,  some  design  philosophies  and  study  strategies  which 
might  distinguish  this  project  from  others  were  adopted.  They  can  be  summarized  as  follows. 

(1)  Make  the  processing  deadlines  explicitly  treated  attributes  of  both  atomic  and  compound 

computation  units. 

(2)  Distinguish  formally  between  the  real-time  database  and  the  archival  database. 

(3)  Pursue  deterministic  time  behavior  in  designing  communication  protocols,  operating  system 

(OS)  structures,  and  application  software. 

(4)  Build  the  distributed  clock  synchronization  logic  into  a  VLSI  component  in  a  network 

interface  unit  to  achieve  the  microsecond  level  synchronization. 

(5)  Explore  the  fault  tolerance  (FT)  schemes  that  can  handle  in  a  uniform  manner  both  hardware 

faults  and  software  faults,  the  latter  including  OS  faults  and  application  software  faults. 

(6)  Identify  the  generic  forward  recovery  techniques  applicable  to  hard-real-time  applications  as 

clearly  distinguished  from  others. 

(7)  Reflect  unique  characteristics  of  tightly  coupled  networks  (TCN’s)  (e.g.,  a  radar-tracking 

parallel  computer  system  located  at  a  single  ground  site)  and  loosely  coupled  netwoiks 
(LCN's)  (e.g.,  local  area  networks  (LAN’s)  in  factories  or  wide  area  networks  in  defense 
applications)  in  developing  fault  tolerance  schemes  and  OS  structures. 

(8)  Develop  first  the  real-time  FT  schemes  that  can  enhance  the  robustness  of  a  computing  station 

(a  processing  node  executing  a  single  application  process)  and  then  the  supplementary 
schemes  for  making  a  group  of  cooperating  computing  stations  fault-tolerant 

(9)  Validate  the  formulated  architectures,  OS  structures,  and  FT  schemes  not  only  via  analyses 

and  logical  reasoning/proofs  but  also  via  experimental  incorporation  into  real-time 
computer  network  testbeds  built  on  TC!N  hardware,  LCN  hardware,  and  functional  real¬ 
time  application  models. 
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3.  Results  of  the  Phase-I  Research  Conducted  at  USF 

Main  results  obtained  during  the  reporting  period  (Phase  I)  are  as  follows. 

3.1  A  preliminary  approach  to  specification  of  timing  constraints  during  distributed 
computing  system  design  and  their  validation 

It  was  the  premise  of  this  research  that  the  designs  of  real-time  computer  systems  needed 
in  critical  applications  must  be  rigorously  verified  for  their  capabilities  for  meeting  the  specified 
deadlines.  Such  designing  with  response  time  guarantee,  called  safe  design  here,  is  dependent 
upon  the  system  configuration  and  the  scheduling  strategy  used  among  other  factors. 

Specification  of  timing  constraints  is  the  very  first  step  in  safe  design  of  real-time  systems. 
One  of  the  timing  specification  approaches  studied  may  be  called  the  time-tagged  block  approach. 
The  basic  idea  is  to  specify  timing  constraints  in  association  with  execution  blocks.  During  the 
Phase  I  the  notion  of  completeness  of  a  set  of  timing  specification  primitives  needed  to  support  the 
time-tagged  block  approach,  was  formalized  and  then  a  complete  and  practical  set  of  primitives 
was  established  [Yan86]. 

Validation  of  time  specifications  is  basically  to  check  the  feasibility  of  meeting  the 
specifications  at  run-time.  A  preliminary  version  of  an  overall  methodology  for  such  validation 
for  the  case  of  distributed  computer  systems  was  formulated  [Yan86].  The  methodology  uses 
various  analytic  verification  techniques  in  its  several  Cv/nsrituent  steps.  The  techniques  are 
generally  of  two  types;  one  that  is  machine-independent,  and  the  other  that  is  machine-dependent. 
The  machine-independent  techniques  detect  the  inconsistency  in  the  specifications  and  the 
impossibility  of  meeting  the  specifications,  when  the  undesirable  properties  persist  regardless  of 
machine  configurations  and  operating  strategies.  The  other  techniques  reflect  the  machine 
characteristics  in  determining  the  execution  feasibility. 

3.2  A  scheme  for  coordinated  execution  of  independently  designed  recoverable  distributed 
processes 

A  scheme  for  facilitating  efficient  backward  recovery  in  loosely  coupled  networks  (LCN's) 
was  developed  [You88].  The  scheme,  called  the  PTCVLCN  (programmer-transparent  coordination 
/  LCN)  scheme,  is  meant  to  be  a  fully  general  approach  to  facilitating  efficient  backward  recovery 
in  LCN  environments  where  the  autonomy  of  each  process  is  highly  desired.  It  shares  the  same 
basic  design  philosophy  with  the  original  PTC  scheme  proposed  earlier  in  [Kim78],  but  was 
formulated  to  fit  LCN  systems  unlike  the  original  PTC  scheme  better  suited  for  centralized 
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systems.  The  scheme  allows  independent  and  uncoordinated  design  of  error  detection  and 
recovery  capabilities  of  distributed  processes.  It  makes  provision  for  properly  coordinating  such 
distributed  processes  at  run-time  for  cooperative  recovery  without  incurring  a  cyclic  chain  of 
rollback  propagations  called  a  domino  effect.  The  operational  rules  of  the  scheme  were  devised 
such  that  a  minimal  number  of  recovery-points  (RP's)  were  used  for  maintaining  the  capability  for 
recovery  with  minimum-distance  rollbacks.  These  capabilities  were  formally  proved. 

The  system  design  philosophy  underlying  the  scheme  is  such  that  each  process  must  be 
solely  responsible  for  detecting  the  errors  that  it  originates.  An  approach  to  making  judicious 
exceptions,  i.e.,  utilizing  the  cooperative  error  detection  capabilities  of  processes  without  incurring 
a  domino  effect,  was  devised  in  order  to  further  enhance  the  system  robustness. 

3.3  Testbed  establishment  and  an  evaluation  of  the  DRB  scheme 

For  rigorous  validation  of  newly  formulated  design  techniques  and  system  structures,  we 
adopted  the  approach  of  testbed-based  validation.  The  availability  of  low-cost  building-blocks 
such  as  microcomputers  and  interconnection  devices,  had  made  the  construction  of  cost-effective 
DCS  testbeds  not  much  more  expensive  than  constructing  pure  software  simulators  running  on 
centralized  computer  systems.  Testbeds  are  capable  of  representing  the  operating  environment 
and  input  scenario  more  accurately  than  software  simulators. 

Initial  versions  of  three  major  real-time  distributed  computing  testbeds  were  established, 
each  including  a  distributed  real-time  control  program  and  a  simulator  of  an  application 
environment  with  sensor  devices  and  actuators.  The  three  testbeds  deal  with  the  three  different 
types  of  real-time  object  tracking  applications:  (1)  tracking  with  a  ground-based  radar,  (2)  tracking 
with  a  sensor  boarded  on  a  high  speed  moving  vehicle,  and  (3)  cooperative  tracking  by  sensors 
distributed  over  multiple  satellites.  The  first  testbed  was  built  around  an  in-house  developed 
tightly  coupled  microcomputer  network  called  the  Macro  Dataflow  Network  (MDN),  the  second 
testbed  around  another  tightly  coupled  microcomputer  network  built  by  Unisys  Corp.,  called  the 
Crossbar  Multi-microcomputer  System  (CMS),  and  the  third  testbed  around  a  local  area  network 
manufactured  by  Cromemco  Inc. 

All  the  real-time  distributed  operating  systems  used  in  the  three  testbeds  were  developed  in 
house.  New  techniques  and  tools  were  to  be  evaluated  by  integrating  them  into  the  testbed 
facilities  and  applying  them  to  the  experimental  development  of  a  practical  network  application 
system. 
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One  of  the  design  techniques  evaluated  by  use  of  the  MDN  testbed  is  the  distributed 
recovery  block  (DRB)  scheme  initially  formulated  in  [Kim84].  The  DRB  scheme  is  based  on  a 
combination  of  both  distributed  concurrent  processing  and  recovery  block  structuring  concepts  to 
fast  forward  error  recovery  and  to  treat  both  hardware  and  software  faults  in  a  uniform 
manner  with  minimal  execution  overhead.  It  is  an  active  redundancy  scheme  where  multiple 
processors  concurrently  execute  multiple  versions  of  a  software  component  and  then  the  same 
acceptance  test.  The  acceptance  test  in  each  processor,  together  with  a  watch-dog  timer,  checks 
reasonableness  of  the  computational  results  of  the  version  executed  as  well  as  the  timeliness  of  the 
execution.  The  scheme  was  incorporated  into  the  MDN  testbed  and  subsequent  measurement  and 
evaluation  demonstrated  the  fast  recovery  capability  of  the  scheme  and  the  soundness  of  the 
implementation  strategies  adopted  [Y0088]. 

The  testbed  facilities  were  transferred  to  UCI  in  early  1987  and  have  since  been  upgraded 
in  major  ways.  The  current  status  of  the  testbeds  established  is  described  in  Appendix  A. 
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4.  Conclusion  and  an  Abstract  of  the  Results  of  the  Phase-II  Research  Conducted  at 

UCI 


The  results  summarized  in  this  report  represent  some  advances  in  the  state  of  the  art  in  the 
design  of  fault-tolerant  real-time  DCS's.  More  importantly,  they  established  directions  for  specific 
research  which  were  pursued  more  extensively  during  Phase  II,  thereby  resulting  in  substantial 
addition  to  the  knowledge  base  related  to  fault-tolerant  real-time  distributed  computing.  Main 
results  obtained  during  the  Phase  n  are  as  follows. 

(1)  Identification  of  critical  research  issues  and  some  promising  research  directions  in  real-time 
fault-tolerant  distributed  computing, 

(2)  A  skeleton  of  the  foundation  for  realizing  system-level  fault  tolerance,  which  includes  among 
others  the  DRB  (distributed  recovery  block)  scheme,  the  DCONV  (distributed  conversation) 
scheme,  the  PTC  (programmer-transparent  coordination)  scheme,  a  TB  (temporary  blackout) 
handling  scheme,  and  the  complementary  relationship  among  the  schemes;  These  schemes  enable 
the  computer  system  to  detect  and  recover  from  both  hardware  and  software  faults  without 
missing  the  deadlines  for  processing  important  data  and  delivering  outputs  to  the  controlled 
object/environment, 

(3)  A  preliminary  structure  of  a  model  of  real-time  distributed  computation, 

(4)  A  theoretical  investigation  into  the  efficiency  and  diagnostic  power  of  basic  processor-level 
diagnosis  approaches  in  diagnosing  hypercubes  conducted, 

(5)  An  enhancement  of  three  of  the  real-time  computer  network  testbeds  established  in  the  UCI 
DREAM  (Distributed  Real-time  Ever  Available  Microcomputing)  Laboratory  made. 

Details  of  these  results  are  available  from  the  publications  listed  in  Section  5.2. 
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Appendix  A 
The  Current  Status  of 

the  Real-Time  Distributed  Computing  Testbeds  Established 


As  a  part  of  the  experimental  work,  the  PI  has  established  a  laboratory  named  the  DREAM 
Laboratory.  The  laboratory  was  started  in  the  PFs  former  institution,  Univ.  of  South  Florida,  in 
1980  and  was  moved  to  his  current  institution,  Univ.  of  California,  Irvine  (UCI),  in  January  1987. 

The  DREAM  Lab  consists  of  the  Loosely  Coupled  Network  G.-CN)  Section  and  the 
Tightly  Coupled  Network  (TCN)  Section.  There  are  two  major  testbeds  established  in  each 
section.  The  equipment  in  the  LCN  Section  includes  a  Cromemco  LAN  connecting  four 
MC68000-based  microcomputers  and  a  LAN  of  four  80386-based  PCs  made  by  AST  Research 
Inc.  and  connected  by  Ethernet.  The  Cromemco  LAN  is  several  years  old  and  the  PC  LAN  was 
established  in  1990.  These  form  the  first  testbed  in  the  LCN  Section.  The  operating  system  and 
real-time  application  software  running  on  the  Cromemco  LAN  has  been  almost  fully  transported 
to  the  PC  LAN  with  some  restructuring  of  the  operating  system.  The  second  testbed  was  added  in 
the  Fall  of  1989.  A  3-node  ADS  (Autonomous  Decentralized  System)  made  by  Hitachi  was 
established.  The  ADS  has  a  unique  communication  architecture  called  the  data  field  that  enables 
easy  expansion  and  reconfiguration.  Each  node  is  based  on  M68020  and  has  a  unique  network 
interface.  One  node  runs  a  UNIX-ACP  combination  and  two  other  nodes  run  the  ACP  (Atom 
Control  Program)  which  is  a  Hitachi's  proprietary  real-time  operating  system. 

The  TCN  Section  includes  two  major  networks:  (1)  One  called  the  Macro-Dataflow 
Network  (MDN)  is  a  homemade  network  of  six  single- board  ZSOOl -based  microcomputers 
connected  through  up  to  12  two-port  buffer  memory  modules  and  (2)  the  other  called  the  Crossbar 
Multi-microcomputer  System  (CMS)  is  a  network  of  seven  single-board  microcomputers  and  five 
multi-port  shared  memory  modules  connected  through  a  crossbar  connection  subsystem  and 
manufactured  by  the  Unisys  Corporation  in  Huntsville,  Alabama. 

Real-time  distributed  operating  systems  and  distributed  application  programs  have  been 
developed  to  run  on  three  computer  networks  in  the  DREAM  Lab.,  i.e.,  PC  LAN,  MDN,  and 
CMS,  and  a  real-time  application  program  was  added  to  run  under  the  manufacturer's  operating 
system  of  the  ADS.  The  application  programs  included  in  the  two  TCN  testbeds  and  the 
PC/Cromemco  testbed  are  the  distributed  real-time  control  programs  combined  with  simulators  of 
applications  environments  with  sensor  devices  and  actuators.  The  three  testbeds  deal  with  the 
three  different  types  of  real-time  object  tracking  applications:  (1)  tracking  with  a  ground-based 
radar  (the  MDN  testbed),  (2)  tracking  with  a  sensor  boarded  on  a  high-speed  moving  vehicle  (the 
CMS  testbed),  and  (3)  cooperative  tracking  by  sensors  distributed  over  multiple  satellites  (the 
PC/Cromemco  testbed,  also  called  the  Defense  Satellite  Network  testbed).  The  PC7Cromemco 
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testbed  thus  deals  with  a  WAN  application  although  the  hardware  base  used  is  a  LAN.  It  is  a  kind 
of  dual-purpose  LANAVAN  testbed. 

The  three  testbeds  dealing  with  object  tracking  applications  were  used  in  conducting  the 
fault  tolerance  experiments  discussed  in  the  main  report.  Figure  1  provides  a  brief  summary  of  the 
current  status  of  the  three  testbeds. 

Quite  a  few  software  and  hardware  tools  for  prototyping  of  real-time  computer  networks 
have  been  established  in  the  UCI  DREAM  Laboratory.  They  include  operating  system 
components,  communication  primitives,  and  high  level  languages  such  as  C,  Extended  Concurrent 
Pascal,  Modula-2,  and  Unisys  PDL.  There  are  also  tools  for  measurement  of  message  delays.  An 
approach  formulated  for  rapid  prototyping  of  software  for  real-time  computer  networks  is  a  two- 
step  approach  in  which  the  first  inefficient  version  is  obtained  with  the  aid  of  an  abstract  high 
level  language  such  as  Extended  Concurrent  Pascal  or  ADA,  and  then  using  the  first  version  as  a 
blueprint,  the  final  version  is  written  in  an  efficient  language  such  as  C.  This  approach  has  been 
partially  tested  in  the  DREAM  Laboratory  with  good  results.  The  main  approach  established  in 
the  DREAM  Laboratory  for  network  performance  measurement  is  to  install  "observation  points" 
within  a  network  such  that  when  a  message  passes  through  an  observation  point,  a  time-stamped 
record  is  made  by  an  observing  machine.  By  comparing  the  time-stamped  records  made  at 
different  observation  points,  the  time  taken  for  a  message  to  travel  between  observation  points  can 
be  obtained. 

PC-based  facilities  for  graphic  display  of  the  run-time  status  of  both  the  distributed 
application  software  and  the  hardware  configuration  ave  been  established  as  integral  components 
of  both  testbeds  (CMS  and  MDN)  in  the  TCN  Section. 
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Real-Time  Defense  Computer  Network  Testbeds  Established 


•  Ten*dnal  defense  controller  (TDC)  testbed 

-  simulates  ground-based  radar  tracking  activities 

-  built  on  the  six-node  tightly  coupled  microcomputer  network 

called  the  MDN  (Macro-Dataflow  Network) 

-  uses  an  OS  developed  in  house 

-  about  lOK  lines  of  Extedned  Concurrent  Pascal,  C,  and  Z8001  assembly  code 

•  On-board  intelligence  (OBI)  testbed 

-  simulates  interceptor-borne  optical  sensor  and  data  processor  activities 

-  built  on  the  seven-node  tightly  coupled  microcomputer  network 

called  the  CMS  (Crossbar  Multi-microcomputer  System) 

-  uses  an  OS  developed  in  house 

-  about  5K  lines  of  C  and  ZSOOl  assembly  code 

•  Defense  satellite  network  (DSN)  testbed 

-  simulates  a  squad  of  mid-course  satellite-borne  radar  tracking  processors 

-  built  initially  on  the  four-node  local  area  network  of 

Cromemco  Z2/68()00  microcomputers 

-  uses  an  OS  developed  in  house 

-  about  lOK  lines  of  Exended  Concurrent  Pascal,  C,  and  M68000  assembly  code 

-  Entire  software  was  first  ported  to  a  LAN  of  Intel  80386-based  PC's  in  1990. 


Figure  1.  An  overview  of  the  three  real-time  defense  computer  networks 
established  in  the  UCI  DREAM  Lab. 
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